Report
Introduction
As part of the course “Projects in Data Analytics for Decision Making”, we are given the task of analysing the “German Credit” dataset. The aim of the project is to obtain a model that may be used to determine if new applicants present a good or bad credit risk.
Exploratory Data Analysis
Structure & Summary
We are going to start our exploratory data analysis by understanding the data. First, let us observe the type of variables of the dataset:
str(GermanCredit)
#> 'data.frame': 1000 obs. of 31 variables:
#> $ CHK_ACCT : int 0 1 3 0 0 3 3 1 3 1 ...
#> $ DURATION : int 6 48 12 42 24 36 24 36 12 30 ...
#> $ HISTORY : int 4 2 4 2 3 2 2 2 2 4 ...
#> $ NEW_CAR : int 0 0 0 0 1 0 0 0 0 1 ...
#> $ USED_CAR : int 0 0 0 0 0 0 0 1 0 0 ...
#> $ FURNITURE : int 0 0 0 1 0 0 1 0 0 0 ...
#> $ RADIO.TV : int 1 1 0 0 0 0 0 0 1 0 ...
#> $ EDUCATION : int 0 0 1 0 0 1 0 0 0 0 ...
#> $ RETRAINING : int 0 0 0 0 0 0 0 0 0 0 ...
#> $ AMOUNT : int 1169 5951 2096 7882 4870 9055 2835 6948 3..
#> $ SAV_ACCT : int 4 0 0 0 0 4 2 0 3 0 ...
#> $ EMPLOYMENT : int 4 2 3 3 2 2 4 2 3 0 ...
#> $ INSTALL_RATE : int 4 2 2 2 3 2 3 2 2 4 ...
#> $ MALE_DIV : int 0 0 0 0 0 0 0 0 1 0 ...
#> $ MALE_SINGLE : int 1 0 1 1 1 1 1 1 0 0 ...
#> $ MALE_MAR_or_WID : int 0 0 0 0 0 0 0 0 0 1 ...
#> $ CO.APPLICANT : int 0 0 0 0 0 0 0 0 0 0 ...
#> $ GUARANTOR : int 0 0 0 1 0 0 0 0 0 0 ...
#> $ PRESENT_RESIDENT: int 4 2 3 4 4 4 4 2 4 2 ...
#> $ REAL_ESTATE : int 1 1 1 0 0 0 0 0 1 0 ...
#> $ PROP_UNKN_NONE : int 0 0 0 0 1 1 0 0 0 0 ...
#> $ AGE : int 67 22 49 45 53 35 53 35 61 28 ...
#> $ OTHER_INSTALL : int 0 0 0 0 0 0 0 0 0 0 ...
#> $ RENT : int 0 0 0 0 0 0 0 1 0 0 ...
#> $ OWN_RES : int 1 1 1 0 0 0 1 0 1 1 ...
#> $ NUM_CREDITS : int 2 1 1 1 2 1 1 1 1 2 ...
#> $ JOB : int 2 2 1 2 2 1 2 3 1 3 ...
#> $ NUM_DEPENDENTS : int 1 1 2 2 2 2 1 1 1 1 ...
#> $ TELEPHONE : int 1 0 0 0 0 1 0 1 0 0 ...
#> $ FOREIGN : int 0 0 0 0 0 0 0 0 0 0 ...
#> $ RESPONSE : int 1 0 1 1 0 1 1 1 1 0 ...We note that all the variables are integer. However, from the data description, a lot of those are actually categorical variables. Hence, in order to have consistent results in our analysis, we are going to transform these variables from integer to factors.
To transform integer variables into correct categorical ones, we use a for loop. While analysing the data, we noted that the variable “EDUCATION” has one outlier: indeed, in one observation it is equal to “-1” instead of 0/1. For this reason, we change the value of this observation.
Moreover, we noted that there is an outlier also for the variable “AGE” and the variable “GUARANTOR”. For the first mentionned, we changed the observation having a value of 125 to 75, while for the guarantor we changed one observation from the value of 2 to 1.
# changing education
for (i in 1:1000){
if (GermanCredit[i,8] == "-1") {
GermanCredit[i,8] <- 1
GermanCredit[,8] <- as.factor(GermanCredit[,8])
}
}
# changing guarantor
for (i in 1:1000){
if (GermanCredit[i,18] == "2") {
GermanCredit[i,18] <- 1
GermanCredit[,18] <- as.factor(GermanCredit[,18])
}
}
# changing age
for (i in 1:1000){
if (GermanCredit[i,22] == "125") {
GermanCredit[i,22] <- 75
}
}
for (i in (1:31)) {
if (is.integer(GermanCredit[,i]) & i != 2 & i !=10 & i !=22) {
GermanCredit[,i] <- as.factor(GermanCredit[,i])
}
}
for (i in c(2,10,22)){
GermanCredit[,i] <- as.numeric(GermanCredit[,i])
}str(GermanCredit)
#> 'data.frame': 1000 obs. of 31 variables:
#> $ CHK_ACCT : Factor w/ 4 levels "0","1","2","3": 1 2 4 1 1 ..
#> $ DURATION : num 6 48 12 42 24 36 24 36 12 30 ...
#> $ HISTORY : Factor w/ 5 levels "0","1","2","3",..: 5 3 5 3..
#> $ NEW_CAR : Factor w/ 2 levels "0","1": 1 1 1 1 2 1 1 1 1 ..
#> $ USED_CAR : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 ..
#> $ FURNITURE : Factor w/ 2 levels "0","1": 1 1 1 2 1 1 2 1 1 ..
#> $ RADIO.TV : Factor w/ 2 levels "0","1": 2 2 1 1 1 1 1 1 2 ..
#> $ EDUCATION : Factor w/ 2 levels "0","1": 1 1 2 1 1 2 1 1 1 ..
#> $ RETRAINING : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 ..
#> $ AMOUNT : num 1169 5951 2096 7882 4870 ...
#> $ SAV_ACCT : Factor w/ 5 levels "0","1","2","3",..: 5 1 1 1..
#> $ EMPLOYMENT : Factor w/ 5 levels "0","1","2","3",..: 5 3 4 4..
#> $ INSTALL_RATE : Factor w/ 4 levels "1","2","3","4": 4 2 2 2 3 ..
#> $ MALE_DIV : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 2 ..
#> $ MALE_SINGLE : Factor w/ 2 levels "0","1": 2 1 2 2 2 2 2 2 1 ..
#> $ MALE_MAR_or_WID : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 ..
#> $ CO.APPLICANT : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 ..
#> $ GUARANTOR : Factor w/ 2 levels "0","1": 1 1 1 2 1 1 1 1 1 ..
#> $ PRESENT_RESIDENT: Factor w/ 4 levels "1","2","3","4": 4 2 3 4 4 ..
#> $ REAL_ESTATE : Factor w/ 2 levels "0","1": 2 2 2 1 1 1 1 1 2 ..
#> $ PROP_UNKN_NONE : Factor w/ 2 levels "0","1": 1 1 1 1 2 2 1 1 1 ..
#> $ AGE : num 67 22 49 45 53 35 53 35 61 28 ...
#> $ OTHER_INSTALL : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 ..
#> $ RENT : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 ..
#> $ OWN_RES : Factor w/ 2 levels "0","1": 2 2 2 1 1 1 2 1 2 ..
#> $ NUM_CREDITS : Factor w/ 4 levels "1","2","3","4": 2 1 1 1 2 ..
#> $ JOB : Factor w/ 4 levels "0","1","2","3": 3 3 2 3 3 ..
#> $ NUM_DEPENDENTS : Factor w/ 2 levels "1","2": 1 1 2 2 2 2 1 1 1 ..
#> $ TELEPHONE : Factor w/ 2 levels "0","1": 2 1 1 1 1 2 1 2 1 ..
#> $ FOREIGN : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 ..
#> $ RESPONSE : Factor w/ 2 levels "0","1": 2 1 2 2 1 2 2 2 2 ..
datatable(head(GermanCredit), class = "cell-border stripe", options = list(scrollX = TRUE)) Now that the variables are in the correct form, we can move on with our analysis.
We start by looking at the structure of the dataset.
plot_intro(GermanCredit, title = "Dataset structure")
types<-inspect_types(GermanCredit)
show_plot(types, col_palette=2)
??show_plotFrom the first plot, we note that there are no missing observations. Moreover, after the variables transformation 87% of them have discrete values, meaning that these are categorical variables. That is confirmed by the second plot that gives us the exact number of columns which are integer and the ones that are factors (categorical variables).
We are now going to dive deeper into the variables’ analysis.
summary(GermanCredit)
#> CHK_ACCT DURATION HISTORY NEW_CAR USED_CAR FURNITURE RADIO.TV
#> 0:274 Min. : 4.0 0: 40 0:766 0:897 0:819 0:720
#> 1:269 1st Qu.:12.0 1: 49 1:234 1:103 1:181 1:280
#> 2: 63 Median :18.0 2:530
#> 3:394 Mean :20.9 3: 88
#> 3rd Qu.:24.0 4:293
#> Max. :72.0
#> EDUCATION RETRAINING AMOUNT SAV_ACCT EMPLOYMENT
#> 0:950 0:903 Min. : 250 0:603 0: 62
#> 1: 50 1: 97 1st Qu.: 1366 1:103 1:172
#> Median : 2320 2: 63 2:339
#> Mean : 3271 3: 48 3:174
#> 3rd Qu.: 3972 4:183 4:253
#> Max. :18424
#> INSTALL_RATE MALE_DIV MALE_SINGLE MALE_MAR_or_WID CO.APPLICANT
#> 1:136 0:950 0:452 0:908 0:959
#> 2:231 1: 50 1:548 1: 92 1: 41
#> 3:157
#> 4:476
#>
#>
#> GUARANTOR PRESENT_RESIDENT REAL_ESTATE PROP_UNKN_NONE
#> 0:948 1:130 0:718 0:846
#> 1: 52 2:308 1:282 1:154
#> 3:149
#> 4:413
#>
#>
#> AGE OTHER_INSTALL RENT OWN_RES NUM_CREDITS JOB
#> Min. :19.0 0:814 0:821 0:287 1:633 0: 22
#> 1st Qu.:27.0 1:186 1:179 1:713 2:333 1:200
#> Median :33.0 3: 28 2:630
#> Mean :35.5 4: 6 3:148
#> 3rd Qu.:42.0
#> Max. :75.0
#> NUM_DEPENDENTS TELEPHONE FOREIGN RESPONSE
#> 1:845 0:596 0:963 0:300
#> 2:155 1:404 1: 37 1:700
#>
#>
#>
#> The summary gives us a first overview of the variables’ details. Interestingly, some categorical variables present an unequal distribution of values. For instance, we note that the variable “CO.APPLICANT” has 959 times “0” and only 41 times “1”. A similar comment can be made for other categorical variables like “MALE_MAR_or_WID”, “FOREIGN”, “MALE_DIV”, “EDUCATION”.
Let us now analyse the numerical variables. For those, we can look at the minimum and the maximum, which gives us a range. For instance, we see that in this bank the duration of the loan goes from 4 to 72 months. The larger the range, the higher the chance of having outliers. For example, the “AMOUNT” variable has a very big range: from 250 to 18’424. Moreover, we note that for all the numerical variables the median is lower than the mean. For instance, “AMOUNT” has a median of 2320 and a mean of 3271, which will mean that the histogram will present a long right tail (positive skewness).
This plot allows us to have a better visualization of how the factor variables are distributed. We confirm that for most of them, the number of instances in each level of the variables are unbalanced. The only variable that seems to have a similar distribution of 1 and 0 is MALE_SINGLE.
Histogram & Density
num.var<-inspect_num(GermanCredit)
show_plot(num.var, col_palette=3)
plot_density(GermanCredit[,-31],
ggtheme = theme_bw())We are going to comment the histogram for each variable. First, let us look at AGE. Most of the people taking the credit are between their 20s and 40s. However, the age range is quite large and it is not uncommon to see people over 60 getting a credit in this dataset.
Second, we shall analysis AMOUNT. This variable is mostly concentrated between 0 and 5000. It looks like in this dataset in it rare to have a high credits (> 10’000).
Third, we look at the DURATION of the credits. Most of them are going to last between 5 and 20 years. Sometime the duration can last even longer, and we can image that this will be the case for high credits.
Lastly, by observing the histograms we note that the three variables are skewed on the right. This makes us think that there is a big difference between the mean and the median. Looking back to the summary, we see that indeed the mean and the median of all the variables are quite different.
Side-by-side boxplots
lblue <- "#6699CC"
par(mfrow = c(1,3))
boxplot(AGE ~ RESPONSE, data = GermanCredit, xlab = "Response", notch = T,
varwidth = T, col = lblue)
boxplot(AMOUNT ~ RESPONSE, data = GermanCredit, xlab = "Response", varwidth = T, col = lblue)
boxplot(DURATION ~ RESPONSE, data = GermanCredit, xlab = "Response", varwidth = T, col = lblue)
mtext("Side-by-side Boxplots", side = 3, line = -1.5, outer = TRUE) The boxplot confirm what stated before. For each variable, we see in which range most values stand according the outcome variable RESPONSE. Moreover, it allows us to have a visual representation of possible outliers: for instance, we see that for AMOUNT there is one observation where the credit is really high, and the outcome variable RESPONSE is 0.
The boxplots allow also to compare the median of the variables according the RESPONSE. On the one hand, AMOUNT has almost the same midian regardless of the outcome. On the other hand both AGE and DURATION show a slightly different median according to RESPONSE. Moreover, we can look at the range of each variable and highlight that the latter is also quite different according to the outcome, as we can clearly see for DURATION. To further develop, the range for RESPONSE = 1 is smaller for both AMOUNT and DURATION and more or less equivalent to RESPONSE = 0 for AGE.
Good vs Bad credits
GermanCredit %>%
select(TELEPHONE, RADIO.TV, RESPONSE) %>%
explore_all(target = RESPONSE)
GermanCredit %>%
select(NEW_CAR, USED_CAR,FURNITURE,EDUCATION, RESPONSE) %>%
explore_all(target = RESPONSE)
GermanCredit %>%
select(FOREIGN, MALE_DIV, MALE_SINGLE, MALE_MAR_or_WID, RESPONSE) %>%
explore_all(target = RESPONSE)
GermanCredit %>%
select(GUARANTOR, CO.APPLICANT, PROP_UNKN_NONE, OTHER_INSTALL, RESPONSE) %>%
explore_all(target = RESPONSE)
GermanCredit %>%
select(REAL_ESTATE, RENT, OWN_RES, RETRAINING, RESPONSE) %>%
explore_all(target = RESPONSE)
GermanCredit %>%
select(NUM_CREDITS, NUM_DEPENDENTS, INSTALL_RATE, RESPONSE) %>%
explore_all(target = RESPONSE) Similarly to what we saw in the boxplot for the integer variables, we now observe the distribution of all the catgorical ones according to the outcome variable RESPONSE.
We note that there are no discrimination for any categorical variable when considering the RESPONSE feature, meaning that we are not going to be able to define RESPONSE only by looking at those variables individually.
After having deeply analysed the dataset, we are now going to look at the correlation between the variables and then proceed with the model analysis.
Correlations
From the correlation matrix we see that only “DURATION” and “AMOUNT” are correlated. Moreover, they are positively correlated. When working on the models we should be aware of this correlation and eventually compute the VIF coefficient in order to be sure that there is no multicollinearity problems.
It seems like there is no correlation between the other variables, meaning that they are independent, which is actually positive since it means that for these variables we will not have problems of multicollinearity.
We are now going to proceed with the modeling section.
Boruta Algorithm
Variable selection is an important part of the model building process. To decide on which variables to use, we employ the Boruta Algorithm. In contrary to other traditional feature selection models, that only rely on a small subset of features for the selection, this algorithm captures all features relevant to the outcome variable. (Source: https://www.analyticsvidhya.com/blog/2016/03/select-important-variables-boruta-package/)
library(Boruta)
set.seed(1)
boruta.train <- Boruta(RESPONSE ~., data =GermanCredit, doTrace = 2)
print(boruta.train)
#> Boruta performed 99 iterations in 1.73 mins.
#> 14 attributes confirmed important: AGE, AMOUNT, CHK_ACCT,
#> DURATION, EMPLOYMENT and 9 more;
#> 12 attributes confirmed unimportant: CO.APPLICANT,
#> EDUCATION, FOREIGN, FURNITURE, MALE_DIV and 7 more;
#> 4 tentative attributes left: INSTALL_RATE, JOB, NUM_CREDITS,
#> RENT;According to the Boruta algorithm, there are 15 attributes that are confirmed as important, 11 as unimportant and 4 that are left as tentative. The latter means that the Boruta model is unable to decide whether those variables are important or not.
Plotting the most important variables
By plotting the most important variables according to the Boruta algorithm, one can observe the individual importance of each feature. For instance, CHK_ACCT appears as the most important one, followed by DURATION and HISTORY. The green variables are the ones considered as important, while the red ones are the unimportant features. Additionally, the yellow ones are the four tentative variables mentioned previously. Those are: JOB, NUM_CREDIT, RENT and FOREIGN.
plot(boruta.train, xlab = "", xaxt = "n")
lz<-lapply(1:ncol(boruta.train$ImpHistory),function(i)
boruta.train$ImpHistory[is.finite(boruta.train$ImpHistory[,i]),i])
names(lz) <- colnames(boruta.train$ImpHistory)
Labels <- sort(sapply(lz,median))
axis(side = 1,las=2,labels = names(Labels),
at = 1:ncol(boruta.train$ImpHistory), cex.axis = 0.7)
title(main="Variable Importance According to the Boruta Algorithm")Tentative Rough Fix
We use the TentativeRoughFix() function in order to make a decision on the tentative variables and classify them as either important or unimportant.
set.seed(2)
final.boruta <- TentativeRoughFix(boruta.train)
print(final.boruta)
#> Boruta performed 99 iterations in 1.73 mins.
#> Tentatives roughfixed over the last 99 iterations.
#> 17 attributes confirmed important: AGE, AMOUNT, CHK_ACCT,
#> DURATION, EMPLOYMENT and 12 more;
#> 13 attributes confirmed unimportant: CO.APPLICANT,
#> EDUCATION, FOREIGN, FURNITURE, MALE_DIV and 8 more;Plotting the variable importance after the fix
plot(final.boruta, xlab = "", xaxt = "n")
lz<-lapply(1:ncol(final.boruta$ImpHistory),function(i)
final.boruta$ImpHistory[is.finite(final.boruta$ImpHistory[,i]),i])
names(lz) <- colnames(final.boruta$ImpHistory)
Labels <- sort(sapply(lz,median))
axis(side = 1,las=2,labels = names(Labels),
at = 1:ncol(final.boruta$ImpHistory), cex.axis = 0.45)
title(main="Variable Importance After the Fix") As we can see, JOB and NUM_CREDITS are now classified as important features. Whereas, RENT and FOREIGN have been classified as unimportant.
Focus on Important Variables
Now that we have selected which features to use, that are the important variables, we are now going to filter the dataset in order to keep only those attributes.
Partition of Data
To begin the modeling section, we start by partitioning the data set into a training set (750 observations) and testing set (250 observations), selected at random.
row.order <- sample(c(1:1000)) # first randomize the order of the rows
german.tr <- GermanCredit[row.order[1:750],] # take the first 750 (random) rows of german for the training set
german.te <- GermanCredit[row.order[751:1000],]In order to be consistent, we decided to build all models using the caret package.
Decision Tree
The first model that we investigate is the decision tree. The model is built on the training set and then the test set is used to measure the prediction capacity of the model. The data is normalised through the preProcess argument. We use the repeated cross-validation method, in order to achieve better results. Also, we balance the data thanks to “sampling = down”, to ensure that the prediction capacity on both classes is balanced. Lastly, we perform a tuning of the complexity parameter (cp), to further improve the model.
set.seed(12)
hp_ct <- data.frame(cp = seq(from = 0.03, to = 0, by = -0.003))
ct.caret <- train(
RESPONSE ~ .,
data = german.tr,
method = "rpart",
preProcess = c("center", "scale"),
trControl = trainControl(
method = "repeatedcv",
number = 10,
repeats = 10,
verboseIter = FALSE,
sampling = "down"
),
tuneGrid = hp_ct
)Best Complexity Parameter
ct.caret$bestTune
#> cp
#> 6 0.015
print(ct.caret)
#> CART
#>
#> 750 samples
#> 17 predictor
#> 2 classes: '0', '1'
#>
#> Pre-processing: centered (34), scaled (34)
#> Resampling: Cross-Validated (10 fold, repeated 10 times)
#> Summary of sample sizes: 675, 675, 675, 675, 676, 675, ...
#> Addtional sampling using down-sampling prior to pre-processing
#>
#> Resampling results across tuning parameters:
#>
#> cp Accuracy Kappa
#> 0.000 0.645 0.257
#> 0.003 0.649 0.262
#> 0.006 0.644 0.257
#> 0.009 0.647 0.261
#> 0.012 0.653 0.273
#> 0.015 0.656 0.280
#> 0.018 0.651 0.278
#> 0.021 0.646 0.274
#> 0.024 0.643 0.271
#> 0.027 0.641 0.271
#> 0.030 0.637 0.266
#>
#> Accuracy was used to select the optimal model using the
#> largest value.
#> The final value used for the model was cp = 0.015.
plot(ct.caret, main = "Best Complexity Parameter")By the information provided above, we can see that the best Accuracy on the training set is of 65.6%. This is achieved by using a complexity parameter of 0.015.
Confusion Matrix
We are now going to look on the performance on the test set.
confusionMatrix(predict.train(ct.caret, newdata = german.te),
german.te$RESPONSE)
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 54 71
#> 1 28 97
#>
#> Accuracy : 0.604
#> 95% CI : (0.54, 0.665)
#> No Information Rate : 0.672
#> P-Value [Acc > NIR] : 0.99
#>
#> Kappa : 0.208
#>
#> Mcnemar's Test P-Value : 2.43e-05
#>
#> Sensitivity : 0.659
#> Specificity : 0.577
#> Pos Pred Value : 0.432
#> Neg Pred Value : 0.776
#> Prevalence : 0.328
#> Detection Rate : 0.216
#> Detection Prevalence : 0.500
#> Balanced Accuracy : 0.618
#>
#> 'Positive' Class : 0
#> As we can see, the overall Accuracy on the test set is of 60.4%. However, one can observe that even though the prediction capacity on each class is more balanced than if we had not balanced the data, the Sensitivity class is better predicted (65.9%) compared to the Specificity class (57.7%). This is confirmed in the Confusion Matrix as well. To further explain, the “1” class is correctly predicted 97 times out of 168 times. Whereas, the “0” class is correctly predicted 54 times out of 82 times. This means that the model performs well in predicting bad credit rating candidates. However, if the aim of the project is to uncover good credit rating candidates based on the explanatory variables, the model will do a poorer job.
Tree drawing
We now plot the final and best model.
Neural Network
The second model that we considered is the Neural Network. Similarly to the Decision Tree, we scaled and balanced the data, as well as, preformed a repeated cross-validation, in order to achieve better results. To go further, we also tuned the “size” and “decay” hyperparameters.
Best Parameters
credit.nn.caret$bestTune
#> size decay
#> 33 4 0.5
print(credit.nn.caret)
#> Neural Network
#>
#> 750 samples
#> 17 predictor
#> 2 classes: '0', '1'
#>
#> Pre-processing: centered (34), scaled (34)
#> Resampling: Cross-Validated (10 fold, repeated 10 times)
#> Summary of sample sizes: 674, 675, 675, 675, 675, 675, ...
#> Addtional sampling using down-sampling prior to pre-processing
#>
#> Resampling results across tuning parameters:
#>
#> size decay Accuracy Kappa
#> 2 0.00 0.669 0.305
#> 2 0.05 0.678 0.307
#> 2 0.10 0.679 0.315
#> 2 0.15 0.682 0.324
#> 2 0.20 0.690 0.338
#> 2 0.25 0.689 0.333
#> 2 0.30 0.691 0.340
#> 2 0.35 0.687 0.333
#> 2 0.40 0.688 0.334
#> 2 0.45 0.692 0.346
#> 2 0.50 0.688 0.334
#> 3 0.00 0.667 0.296
#> 3 0.05 0.659 0.282
#> 3 0.10 0.658 0.281
#> 3 0.15 0.661 0.287
#> 3 0.20 0.675 0.315
#> 3 0.25 0.672 0.310
#> 3 0.30 0.676 0.314
#> 3 0.35 0.673 0.311
#> 3 0.40 0.684 0.327
#> 3 0.45 0.683 0.326
#> 3 0.50 0.678 0.313
#> 4 0.00 0.663 0.290
#> 4 0.05 0.660 0.278
#> 4 0.10 0.658 0.274
#> 4 0.15 0.670 0.298
#> 4 0.20 0.677 0.309
#> 4 0.25 0.674 0.303
#> 4 0.30 0.681 0.318
#> 4 0.35 0.687 0.333
#> 4 0.40 0.674 0.305
#> 4 0.45 0.677 0.312
#> 4 0.50 0.693 0.344
#>
#> Accuracy was used to select the optimal model using the
#> largest value.
#> The final values used for the model were size = 4 and decay = 0.5.
plot(credit.nn.caret) As we can see, the highest Accuracy (69.3%) on the training set can be obtained by setting size = 4 and decay = 0.5.
Confusion Matrix
confusionMatrix(predict.train(credit.nn.caret, newdata = german.te),
german.te$RESPONSE)
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 52 56
#> 1 30 112
#>
#> Accuracy : 0.656
#> 95% CI : (0.594, 0.715)
#> No Information Rate : 0.672
#> P-Value [Acc > NIR] : 0.72939
#>
#> Kappa : 0.278
#>
#> Mcnemar's Test P-Value : 0.00702
#>
#> Sensitivity : 0.634
#> Specificity : 0.667
#> Pos Pred Value : 0.481
#> Neg Pred Value : 0.789
#> Prevalence : 0.328
#> Detection Rate : 0.208
#> Detection Prevalence : 0.432
#> Balanced Accuracy : 0.650
#>
#> 'Positive' Class : 0
#> The results on the test set are slightly better than for the Decision Tree. Overall we obtain an Accuracy of 65.6%. The prediction of each class is balanced even though the Specificity class (66.7%) is better predicted than the Sensitivity class (63.4%), unlike for the Decision Tree model.
The results are confirmed in the Confusion Matrix where the “1” class is correctly predicted 112 times out of 168 times. Whereas, the “0” class is correctly predicted 52 times out of 82 times. In this way, the Neural Network outperforms the Decision Tree model in predicting the Specificity class (1). However, the results are almost equivalent for the Sensitivity class (0).
Logistic Regression
The third model that we considered is the Logistic Regression, that is a powerful classifier for binary outputs.
set.seed(1)
credit.log.caret <-
train(
form = RESPONSE ~ .,
data = german.tr,
method = "glm",
preProcess = c("center", "scale"),
trControl = trainControl(
method = "repeatedcv",
number = 10,
repeats = 10,
verboseIter = FALSE,
sampling = "down"
)
)Accuracy measure
print(credit.log.caret)
#> Generalized Linear Model
#>
#> 750 samples
#> 17 predictor
#> 2 classes: '0', '1'
#>
#> Pre-processing: centered (34), scaled (34)
#> Resampling: Cross-Validated (10 fold, repeated 10 times)
#> Summary of sample sizes: 676, 675, 675, 675, 674, 675, ...
#> Addtional sampling using down-sampling prior to pre-processing
#>
#> Resampling results:
#>
#> Accuracy Kappa
#> 0.709 0.375
summary(credit.log.caret)
#>
#> Call:
#> NULL
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -2.5416 -0.7706 -0.0075 0.7741 2.4398
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 0.02113 0.12469 0.17 0.86542
#> CHK_ACCT1 0.19508 0.14609 1.34 0.18176
#> CHK_ACCT2 0.09360 0.11816 0.79 0.42827
#> CHK_ACCT3 0.82072 0.15892 5.16 2.4e-07 ***
#> DURATION -0.66454 0.19087 -3.48 0.00050 ***
#> HISTORY1 -0.32239 0.22461 -1.44 0.15119
#> HISTORY2 -0.11774 0.36205 -0.33 0.74502
#> HISTORY3 0.11618 0.23238 0.50 0.61710
#> HISTORY4 0.57492 0.33601 1.71 0.08708 .
#> AMOUNT -0.28449 0.20718 -1.37 0.16970
#> SAV_ACCT1 0.22613 0.13634 1.66 0.09720 .
#> SAV_ACCT2 0.02273 0.12530 0.18 0.85605
#> SAV_ACCT3 0.42695 0.15181 2.81 0.00492 **
#> SAV_ACCT4 0.47704 0.13591 3.51 0.00045 ***
#> GUARANTOR1 0.34802 0.13005 2.68 0.00745 **
#> OTHER_INSTALL1 -0.28149 0.12943 -2.17 0.02964 *
#> EMPLOYMENT1 0.56220 0.27234 2.06 0.03899 *
#> EMPLOYMENT2 0.86035 0.33278 2.59 0.00973 **
#> EMPLOYMENT3 0.86778 0.26890 3.23 0.00125 **
#> EMPLOYMENT4 0.79507 0.29741 2.67 0.00751 **
#> AGE 0.18040 0.15439 1.17 0.24262
#> USED_CAR1 0.30533 0.14220 2.15 0.03178 *
#> REAL_ESTATE1 0.18677 0.13576 1.38 0.16889
#> PROP_UNKN_NONE1 -0.11630 0.15989 -0.73 0.46700
#> NEW_CAR1 -0.14202 0.12861 -1.10 0.26948
#> OWN_RES1 0.19616 0.14084 1.39 0.16368
#> INSTALL_RATE2 -0.22620 0.18269 -1.24 0.21564
#> INSTALL_RATE3 -0.27500 0.17953 -1.53 0.12556
#> INSTALL_RATE4 -0.44619 0.21778 -2.05 0.04049 *
#> JOB1 -0.26772 0.44733 -0.60 0.54951
#> JOB2 -0.21575 0.53794 -0.40 0.68837
#> JOB3 0.00668 0.41866 0.02 0.98726
#> NUM_CREDITS2 -0.34462 0.17566 -1.96 0.04978 *
#> NUM_CREDITS3 -0.18821 0.13829 -1.36 0.17353
#> NUM_CREDITS4 -0.11432 0.12775 -0.89 0.37086
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 604.42 on 435 degrees of freedom
#> Residual deviance: 415.47 on 401 degrees of freedom
#> AIC: 485.5
#>
#> Number of Fisher Scoring iterations: 5The results of the training show that not all variables significantly impact the RESPONSE variable. For instance, according to the model, variables such as JOB or NEW_CAR do not impact the RESPONSE variable significantly. Other variables may impact the explained variable at different thresholds. For example, CHK_ACCT3 has a significant influence at a threshold of 0.1%, whereas EMPLOYMENT4, INSTALL_RATE4 and HISTORY4 have impacts at thresholds of, respectively, 1%, 5% and 10%.
Overall, the model presents a good Accuracy (70.9%) on the training set. However, to truly evaluate its prediction capacity we would need to consider the Accuracy on the test set.
Confusion Matrix
confusionMatrix(predict.train(credit.log.caret, newdata = german.te),
german.te$RESPONSE)
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 60 56
#> 1 22 112
#>
#> Accuracy : 0.688
#> 95% CI : (0.627, 0.745)
#> No Information Rate : 0.672
#> P-Value [Acc > NIR] : 0.320846
#>
#> Kappa : 0.36
#>
#> Mcnemar's Test P-Value : 0.000187
#>
#> Sensitivity : 0.732
#> Specificity : 0.667
#> Pos Pred Value : 0.517
#> Neg Pred Value : 0.836
#> Prevalence : 0.328
#> Detection Rate : 0.240
#> Detection Prevalence : 0.464
#> Balanced Accuracy : 0.699
#>
#> 'Positive' Class : 0
#> The analysis of the predictions on the test set reveals that it is as well a good model since the overall Accuracy is of 68.8%. This means, that overall the Logistic Regression outperforms both the Decision Tree and the Neural Network.
When looking in more details, we can see that both classes are more or less well predicted. Similarly to the Decision Tree model, the Sensitivity class (73.2%) is better predicted than the Specificity class (66.7%). The results on the Confusion Matrix confirm this. To further explain, the “1” class is correctly predicted 112 times out of 168. Whereas, the “0” class is accurately predicted 60 times out of 82.
This means, that for predicting the Specificity class (“1”), the Logistic Regression performs better than the Decision Tree, but equals the performance of the Neural Network. When looking at the Sensitivity class (“0”), this model outperforms both the Neural Network and the Decision Tree. In other words, when only considering these three models seen up to now, the Neural Network and the Logistic Regression would be the better models to find good credit candidates. On the other hand, if the aim of the project is to detect bad credit candidates, then the Logistic Regression model should be preferred. Overall, this confirms that the Logistic Regression does a better job than both the Neural Network and the Decision Tree.
Let’s look further into other models.
Discriminant Analysis
The fourth model considered is the Discriminant Analysis. First, we are going to use a Linear Discriminant Analysis (LDA), then a Quadratic Discriminant Analysis (QDA).
LDA
set.seed(4567)
lda.fit <- caret::train(RESPONSE ~ .,
data=german.tr,
method="lda",
preProcess=c("center", "scale"),
trControl=trainControl(method="repeatedcv", number=10,
repeats=10, verboseIter=FALSE, sampling = "down")
)After having trained the model, we are going to make predictions on the test set and look at the results.
Confusion Matrix
lda.pred <- predict(lda.fit, GermanCredit)
confusionMatrix(predict.train(lda.fit, newdata = german.te),
german.te$RESPONSE)
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 60 57
#> 1 22 111
#>
#> Accuracy : 0.684
#> 95% CI : (0.622, 0.741)
#> No Information Rate : 0.672
#> P-Value [Acc > NIR] : 0.370775
#>
#> Kappa : 0.354
#>
#> Mcnemar's Test P-Value : 0.000131
#>
#> Sensitivity : 0.732
#> Specificity : 0.661
#> Pos Pred Value : 0.513
#> Neg Pred Value : 0.835
#> Prevalence : 0.328
#> Detection Rate : 0.240
#> Detection Prevalence : 0.468
#> Balanced Accuracy : 0.696
#>
#> 'Positive' Class : 0
#> One can observe that the Accuracy is of 68.4%. So far, this is a good model.
Looking into more details, we note that the Sensitivity and the Specificity are respectively of 73.2% and 66.1%. The model was able to predict 60 times out of 82 a bad credit, and 111 times out of 168 a good credit. Comparing with the other models, the LDA has similar prediction capacity as the Logistic Regression for the class (“1”), meaning the good credits, as well as, for the prediction of class (“0”).
QDA
We are now going to create a model using a Quadritic Discriminant Analysis.
set.seed(1)
qda.fit <- caret::train(RESPONSE ~ .,
data=german.tr,
method="qda",
preProcess=c("center", "scale"),
trControl=trainControl(method="repeatedcv", number=10,
repeats=10, verboseIter=FALSE, sampling = "down")
)
qda.fit
#> Quadratic Discriminant Analysis
#>
#> 750 samples
#> 17 predictor
#> 2 classes: '0', '1'
#>
#> Pre-processing: centered (34), scaled (34)
#> Resampling: Cross-Validated (10 fold, repeated 10 times)
#> Summary of sample sizes: 676, 675, 675, 675, 674, 675, ...
#> Addtional sampling using down-sampling prior to pre-processing
#>
#> Resampling results:
#>
#> Accuracy Kappa
#> 0.683 0.311When training the model, the Accuracy is at 68% which is lower compared to the Linear Discriminant Analysis Model. We are still going to see how the model performs when fitted on the test dataset.
Confusion Matrix
qda.pred <- predict(qda.fit, GermanCredit)
confusionMatrix(predict.train(qda.fit, newdata = german.te),
german.te$RESPONSE)
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 50 40
#> 1 32 128
#>
#> Accuracy : 0.712
#> 95% CI : (0.652, 0.767)
#> No Information Rate : 0.672
#> P-Value [Acc > NIR] : 0.0994
#>
#> Kappa : 0.363
#>
#> Mcnemar's Test P-Value : 0.4094
#>
#> Sensitivity : 0.610
#> Specificity : 0.762
#> Pos Pred Value : 0.556
#> Neg Pred Value : 0.800
#> Prevalence : 0.328
#> Detection Rate : 0.200
#> Detection Prevalence : 0.360
#> Balanced Accuracy : 0.686
#>
#> 'Positive' Class : 0
#> When the model is fitted on the test set, the Accuracy is of 71.2%, which is the highest level achieved so far. However when looking at the Sensitivity, one can remark that the it has a low value (60.98%). The model is able has predicted 50 bad credits out of 82, making him the worst in terms of prediction capability of the class “0”. On the other hand, the Specificity is of 76.19%, the highest value achieved for the “1” class.
Hence, we can say that the model is very good at predicting the good credits (“1”), but very bad a predicting the bad ones. Even if the Accuracy is high, it is preferable to use a model that is able to better predict both classes.
Support Vector Machine I
The next model is the Support Vector Machine. We will first use the Radial, then the Poly model.
library(doParallel) # use the parallel backend
cl<-makeCluster(detectCores()) # detect and create a cluster
registerDoParallel(cl)
C <- c(0.25, 0.1, 0.5, 1, 10, 100)
sigma <- c(0.0001, 0.001, 0.01, 0.1, 1)
gr.radial<-expand.grid(C = C, sigma = sigma)
set.seed(123)
system.time(model_svm<-caret::train(RESPONSE ~ .,
data = german.tr,
method = "svmRadial",
preProcess = "range",
trace=FALSE,
trControl = trainControl(method = "repeatedcv", number = 10, repeats = 10,
verboseIter = FALSE, sampling = "down"),
tuneGrid=gr.radial))
#> user system elapsed
#> 5.48 2.92 200.33
model_svm
#> Support Vector Machines with Radial Basis Function Kernel
#>
#> 750 samples
#> 17 predictor
#> 2 classes: '0', '1'
#>
#> Pre-processing: re-scaling to [0, 1] (34)
#> Resampling: Cross-Validated (10 fold, repeated 10 times)
#> Summary of sample sizes: 675, 675, 675, 674, 675, 676, ...
#> Addtional sampling using down-sampling prior to pre-processing
#>
#> Resampling results across tuning parameters:
#>
#> C sigma Accuracy Kappa
#> 0.10 1e-04 0.745 0.37806
#> 0.10 1e-03 0.739 0.37703
#> 0.10 1e-02 0.698 0.36511
#> 0.10 1e-01 0.688 0.24282
#> 0.10 1e+00 0.531 -0.00603
#> 0.25 1e-04 0.745 0.37428
#> 0.25 1e-03 0.744 0.38196
#> 0.25 1e-02 0.688 0.35185
#> 0.25 1e-01 0.687 0.24421
#> 0.25 1e+00 0.542 -0.01159
#> 0.50 1e-04 0.743 0.36487
#> 0.50 1e-03 0.740 0.38867
#> 0.50 1e-02 0.690 0.35079
#> 0.50 1e-01 0.710 0.32079
#> 0.50 1e+00 0.564 -0.00680
#> 1.00 1e-04 0.745 0.37358
#> 1.00 1e-03 0.697 0.36213
#> 1.00 1e-02 0.694 0.36015
#> 1.00 1e-01 0.680 0.30568
#> 1.00 1e+00 0.500 0.02753
#> 10.00 1e-04 0.688 0.35042
#> 10.00 1e-03 0.692 0.35993
#> 10.00 1e-02 0.695 0.35357
#> 10.00 1e-01 0.670 0.28632
#> 10.00 1e+00 0.476 0.02852
#> 100.00 1e-04 0.694 0.36208
#> 100.00 1e-03 0.697 0.35582
#> 100.00 1e-02 0.664 0.28904
#> 100.00 1e-01 0.672 0.29221
#> 100.00 1e+00 0.453 0.02320
#>
#> Accuracy was used to select the optimal model using the
#> largest value.
#> The final values used for the model were sigma = 1e-04 and C = 0.25.
model_svm$bestTune
#> sigma C
#> 6 1e-04 0.25
plot(model_svm)As we can see from the model summary and later confirmed using the best Tune and the plot, the model is well fitted to the data when Sigma = 0.0001 and the Cost C = 0.25. As we can see, the Cost is very small. Having a small Cost means that the margins are going to be very large. The risk of having such a small Cost is that the model is under-fitted. If that is the case, we would not be able to well predict the outcome variable on the test set. We can look at the Confusion Matrix and see what happens.
Confusion Matrix
confusionMatrix(predict.train(model_svm, newdata = german.te),
german.te$RESPONSE)
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 23 16
#> 1 59 152
#>
#> Accuracy : 0.7
#> 95% CI : (0.639, 0.756)
#> No Information Rate : 0.672
#> P-Value [Acc > NIR] : 0.191
#>
#> Kappa : 0.214
#>
#> Mcnemar's Test P-Value : 1.24e-06
#>
#> Sensitivity : 0.280
#> Specificity : 0.905
#> Pos Pred Value : 0.590
#> Neg Pred Value : 0.720
#> Prevalence : 0.328
#> Detection Rate : 0.092
#> Detection Prevalence : 0.156
#> Balanced Accuracy : 0.593
#>
#> 'Positive' Class : 0
#> The model’s Accuracy is of 70%. However, when looking at the Sensitivity and the Specificity, we realise that it is unable to balance the classes despite having set “sampling = down”. To further explain, the Sensitivity is very low (28,05%). Indeed the model, correctly predicts the “0” class 23 times out of 82. On the other hand, this model is very good at predicting the “1” class. In fact, it achieves a Specificity score of 90.48%.
This means that the model is very good at predicting good creditors but does a very poor jobs at predicting bad ones. In the light of what we have seen, we can confirm that the model is under-fitting. Indeed, it is unable to predict correctly the “0” class.
Support Vector Machine II
After having fitted a first Support Vector Machine model, we are now going to fit another one using “svmPoly” and see if the results improve.
C <- c(0.1, 1, 10, 100)
degree <- c(1, 2, 3)
scale <- 1
gr.poly <- expand.grid(C = C, degree = degree, scale = scale)
ctrl <- trainControl(method = "cv",
number = 3,
sampling = "down")
set.seed(123)
system.time(
model_svm_poly <-
train(
RESPONSE ~ .,
data = german.tr,
method = "svmPoly",
trControl = ctrl,
tuneGrid = gr.poly
)
)
#> user system elapsed
#> 1.726 0.043 33.216
model_svm_poly
#> Support Vector Machines with Polynomial Kernel
#>
#> 750 samples
#> 17 predictor
#> 2 classes: '0', '1'
#>
#> No pre-processing
#> Resampling: Cross-Validated (3 fold)
#> Summary of sample sizes: 499, 501, 500
#> Addtional sampling using down-sampling
#>
#> Resampling results across tuning parameters:
#>
#> C degree Accuracy Kappa
#> 0.1 1 0.743 0.419
#> 0.1 2 0.633 0.233
#> 0.1 3 0.632 0.187
#> 1.0 1 0.699 0.364
#> 1.0 2 0.607 0.175
#> 1.0 3 0.510 0.123
#> 10.0 1 0.711 0.342
#> 10.0 2 0.597 0.170
#> 10.0 3 0.534 0.153
#> 100.0 1 0.684 0.329
#> 100.0 2 0.619 0.219
#> 100.0 3 0.632 0.236
#>
#> Tuning parameter 'scale' was held constant at a value of 1
#> Accuracy was used to select the optimal model using the
#> largest value.
#> The final values used for the model were degree = 1, scale = 1 and
#> C = 0.1.
model_svm_poly$bestTune
#> degree scale C
#> 1 1 1 0.1
plot(model_svm_poly)This Support Vector Machine model does not only tune the Cost parameter, but also the Scale and the Degree. The output of the model’s summary, as well as the bestTune function and the graph show that the model gives the best results when Degree = 1, Scale = 1, and C = 0.1. We are now going to analyse the Confusion Matrix and see if the model’s prediction capacity are good.
Confusion Matrix
confusionMatrix(predict.train(model_svm_poly, newdata = german.te),
german.te$RESPONSE)
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 56 57
#> 1 26 111
#>
#> Accuracy : 0.668
#> 95% CI : (0.606, 0.726)
#> No Information Rate : 0.672
#> P-Value [Acc > NIR] : 0.582939
#>
#> Kappa : 0.313
#>
#> Mcnemar's Test P-Value : 0.000991
#>
#> Sensitivity : 0.683
#> Specificity : 0.661
#> Pos Pred Value : 0.496
#> Neg Pred Value : 0.810
#> Prevalence : 0.328
#> Detection Rate : 0.224
#> Detection Prevalence : 0.452
#> Balanced Accuracy : 0.672
#>
#> 'Positive' Class : 0
#> This Support Vector Machine model has an Accuracy of 66.8%. It is a bit lower than the Radial one. However, we can highlight that, even though it does a poorer job, it predicts both classes in a more balanced way. This one is slightly better at predicting the bad creditors (“0”) compared to the good ones. Indeed, the model is able to predict correctly bad creditors 56 times out of 82 and good creditors 111 times out of 168.
Ensemble Methods (Random Forest)
The last model we are going to analyse is the Random Forest.
set.seed(1994)
modelLookup(model="rf")
#> model parameter label forReg forClass
#> 1 rf mtry #Randomly Selected Predictors TRUE TRUE
#> probModel
#> 1 TRUE
model_rf <- caret::train(RESPONSE ~ .,
data=german.tr,
method="rf",
preProcess=c("center", "scale"),
trControl=trainControl(method="repeatedcv", number=10,
repeats=10, verboseIter=FALSE, sampling="down")
)
model_rf
#> Random Forest
#>
#> 750 samples
#> 17 predictor
#> 2 classes: '0', '1'
#>
#> Pre-processing: centered (34), scaled (34)
#> Resampling: Cross-Validated (10 fold, repeated 10 times)
#> Summary of sample sizes: 675, 675, 675, 675, 674, 675, ...
#> Addtional sampling using down-sampling prior to pre-processing
#>
#> Resampling results across tuning parameters:
#>
#> mtry Accuracy Kappa
#> 2 0.690 0.360
#> 18 0.691 0.352
#> 34 0.685 0.342
#>
#> Accuracy was used to select the optimal model using the
#> largest value.
#> The final value used for the model was mtry = 18.The output shows that the best Accuracy can be reached when mtry = 18. Indeed, the model has been doing different simulation with different mtry (Number of variables randomly sampled as candidates at each split), and ended by fitting the model to the training set using mtry = 18. Indeed, when we run the best tune we obtain the following:
We are now going to look at how the model behaves when predicting the output from the test set.
Confusion Matrix
confusionMatrix(predict.train(model_rf, newdata = german.te),
german.te$RESPONSE)
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 62 56
#> 1 20 112
#>
#> Accuracy : 0.696
#> 95% CI : (0.635, 0.752)
#> No Information Rate : 0.672
#> P-Value [Acc > NIR] : 0.23
#>
#> Kappa : 0.38
#>
#> Mcnemar's Test P-Value : 5.95e-05
#>
#> Sensitivity : 0.756
#> Specificity : 0.667
#> Pos Pred Value : 0.525
#> Neg Pred Value : 0.848
#> Prevalence : 0.328
#> Detection Rate : 0.248
#> Detection Prevalence : 0.472
#> Balanced Accuracy : 0.711
#>
#> 'Positive' Class : 0
#> The model’s Accuracy is of 69.6%.
When looking at the Sensitivity and the Specificity, one can note that the model is better at predicting the bad credits (“0”). Indeed, the Sensitivity is of 75.6% while the Specificity of 66.7%. So when we fitted as random forest, we remark that 112 times out of 168 the model is able to predict good credits and 62 times out of 82, a bad one. Overall, the model is satisfying as it is able to predicted both classes pretty well.
Generally, to have even better results we could make a variable importance analysis. However, having used the Bortuta Algorithm at the beginning of our analysis, this step is no longer necessary.
Conclusion
To put in a nutshell, we have observed that overall most of the models are predicting quite well both classes. When considering the top 3 in terms of overall Accuracy, the ones that stand out are the QDA model (71.2%), followed by the Radial SVM model (70.0%), and Random Forest model (69.6%).
However, depending on the aim of the project, the choice of model may differ. To further explain, if the goal is to predict good credit candidates, one should consider looking at the models having the highest Specificity rates, that are, in this case, the same as for the overall Accuracy: the Radial SVM (90.5%), followed by the QDA model (76.19%) and the Random Forest model (66.7%).
On the other hand, if the aim of the project is to detect and predict bad creditors, then the models considered should be the ones having the highest Sensitivity rates:the Random Forest model (75.61%), followed by the Logistic Regression and LDA models (both at 73.2%).
However, it is better to identify the bad creditors in order to avoid lending money to people who are not going to pay back. Hence, the model you should use to define if a client can be considered as a good or a bad creditor is the one with the highest sensitivity, thus, the random forest.